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DISCUSSION: A TALE OF THREE COUSINS: LASSO, 
L2BOOSTING AND DANTZIG 

By N. Meinshausen, 1 G. Rocha 2 and B. Yu 3 
University of California, Berkeley 

We would like to congratulate the authors for their thought-provoking 
and interesting paper. The Dantzig paper is on the timely topic of high- 
dimensional data modeling that has been the center of much research lately 
and where many exciting results have been obtained. It also falls in the 
very hot area at the interface of statistics and optimization: l\ -constrained 
minimization in linear models for computationally efficient model selection, 
or sparse model estimation (Chen, Donoho and Saunders [5] and Tibshirani 
[17]). The sparsity consideration indicates a trend in high-dimensional data 
modeling advancing from prediction, the hallmark of machine learning, to 
sparsity — a proxy for interpretability. This trend has been greatly fueled by 
the participation of statisticians in machine learning research. In particular, 
Lasso (Tibshirani [17]) is the focus of many sparsity studies in terms of 
both theoretical analysis (Knight and Fu [10], Greenshtein and Ritov [9], 
van de Geer [19], Bunea, Tsybakov and Wegkamp [3], Meinshausen and 
Biihlmann [13], Zhao and Yu [23] and Wainwright [20]) and fast algorithm 
development (Osborne, Presnell and Turlach [15] and Efron et al. [8]). 

Given n units of data Z { = (X^Yj) with Y £ R and Xj £ MP for i = 
1, . . . , n, let Y = (Yi, . . . , Y n ) T £ R n be the continuous vector response vari- 
able and X = (Xi, . . . ,X n ) T the n x p design matrix and let the columns 
of X be normalized to have ^2- norm 1. It is often useful to assume a linear 
regression model, 

(1) Y = Xf3 + e, 

where e is an i.i.d. N(Q,a 2 ) vector of size n. 
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Lasso minimizes the ^-norm of the parameters subject to a constraint on 
squared error loss. That is, /3 lasso (i) solves the l\ -constrained minimization 
problem 

(2) mm II/3H! subject to \\\Y - Xf3\\ 2 2 <t. 

We can clearly use constraint and objective function interchangeably. For 
each value of t > 0, one can also find a value of the Lagrange multiplier A so 
that Lasso is the solution of the penalized version 

(3) min±||y-J*7?||i+ A][/3]|i. 

P 

Finally, it is well known that an alternative form of Lasso (Osborne, Presnell 
and Turlach [15]) asserts that /3^ asso also solves 

(4) min\0 T X T X0 subject to \\X T {Y - Xp)^ < A, 

where A is identical to the penalty parameter in the penalized version (3). In 
what follows, we consider Dantzig estimates (3 x an zlg solving the constrained 
minimization problem 

(5) mm II/3H! subject to \\X T (Y - X (3)^ < X. 

The Dantzig selector as proposed by the authors uses A = X p (a) = cr v / 21ogp. 
To distinguish the two, we reserve the term Danzig selector for this particular 
choice of A throughout this discussion. Comparing Dantzig with Lasso in its 
forms (4) and (5) reveals very clearly their close kinship. Hence we would like 
to view the Dantzig paper in the context of the vast literature on Lasso. We 
will start with some comments on the theoretical side before concentrating 
on comparing Dantzig and Lasso from the points of view of algorithmic and 
statistical performance. 

1. Lasso and Dantzig: theoretical results. Assuming a is known, the 
Dantzig selector uses a fixed tuning parameter X p (o~). Under a condition 
called Uniform Uncertainty Principle (which requires almost orthonormal 
predictors when choosing subsets of variables), an effective bound is obtained 
on the MSE ||/3 A a ^ lg — 0\\2 for the Dantzig selector. After a simple step of 
bounding the projected errors on the predictors, the proof is deterministic. 
This bounding step gives rise to the particular chosen threshold X p (o~). In 
terms of tools used, this paper is closely related to earlier papers by the 
authors, Donoho, Elad and Temlyakov [7] and Donoho [6] on Lasso. 

There is a parallel development of understanding Lasso under the linear 
regression model in (1) with stochastic tools. The results are in terms of the 
^2-MSE on and also in terms of the ^2-MSE on the regression function X0 
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(e.g., Greenshtein and Ritov [9], Bunea, Tsybakov and Wegkamp [3], van de 
Geer [19], Zhang and Huang [22] and Meinshausen and Yu [14]). Related re- 
sults for L2Boosting are obtained by Biihlmann [2]. Since Lasso is important 
for its model selection property, it is natural to study directly Lasso's model 
selection consistency as in the work of Meinshausen and Biihlmann [13], 
Zhao and Yu [23], Zou [24], Wainwright [20, 21] and Tropp [18]. What has 
emerged from this cluster of work is the necessity of an irrepresentable con- 
dition for Lasso to select the correct variables under sparsity conditions on 
the model. This condition regulates how correlated the predictors can be 
before wrong predictors are selected. However, this condition can be relaxed 
and Lasso still behaves sensibly. Specifically, Meinshausen and Yu [14] and 
Zhang and Huang [22] assume less restricted conditions on the predictors 
than the UUP condition to derive a bound on the same MSE (/3) for an 
arbitrary A. The bound is probably weaker than the Dantzig bound, but 
the assumptions are weaker as well so it covers commonly occurring highly 
correlated predictors. It is a consequence of this bound that in the case of 
p> n, if the model is sparse, Lasso can reduce significantly the number of 
predictors while keeping the correct ones. It would be interesting to see the 
Dantzig bound generalized to the case of more correlated predictors and for 
a range of A's since a is mostly unknown in practice and has to be estimated. 

2. Lasso and Dantzig: algorithm and performance. The similarities of 
Lasso and Dantzig revealed in (4) and (5) beg us to ask: How does Dantzig 
differ from Lasso? Which one should one use in practice and why? Let us 
start with a simple case where geometric visualizations of Dantzig and Lasso 
optimization problems can be easily displayed. 

Lasso versus Dantzig: p = 3 and in the population limit. We choose three 
predictors from the multivariate normal distribution with a zero mean vector 
and a covariance matrix V with a unit diagonal and entries Vvi = and 
V13 = V23 = r, where |r| < l/\/2 to guarantee positive definiteness of V. 
For simplicity, we consider the case of n = 00, so we have zero noise and 
the population covariance V. We do this by setting the observations to be 

Y = Xj3* , with (3* = (1, 1, 0) and X given by the Cholesky decomposition of 

V so X'X = V . For the purpose of visualization, we rewrite the minimization 
problems in (2) and (5) in the alternative forms 

(6) min ||Y — X/3||2 subject to ||/3||i<i, for Lasso; 

(7) min||X'(Y- JT^Hoo subject to ||/3||i<t, for Dantzig. 

In Figure 1, we display six plots of these alternative minimization problems. 
In the two leftmost columns, the £i-polytopes sitting at the origin give the 
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Fig. 1. The panels in the first row and second row refer respectively to Lasso and Dantzig. 
The geometry in the f3 space for the optimization problems (6) and (7) is shown for the 
uncorrected design (leftmost panel) and for correlated design with r — 0.5 (middle panel). 
The Lasso solution is the point where the ellipsoid of £2 -loss touches the i\ -polytope and 
is unique in both cases. For Dantzig, the solutions are given by the points touching the 
li -polytope and the box-shaped I x - constraint on the correlations of the predictor variables 
with the residuals. For r — 0.5, the solution is not uniquely determined for Dantzig as the 
side of the box aligns with the surface of the i\-polytope. The rightmost column shows 
the third component p$ of the respective solution as a function of the correlation r and 
the regularization parameter A as defined in (4) and (5). The Dantzig solution is not 
continuous at r — 0.5. 



same l\ -constraint ||/3||i < 1. The touching ball or ellipsoids in the first row 
correspond to the Lasso ^-objective function for the Lasso, while the cube 
and polytopes in the second row correspond to the -^co-objective function for 
Dantzig. In the first column of the plots, r = and both Lasso and Dantzig 
correctly select only the first two variables. In the second column, we set the 
correlation at r = 0.5. The Lasso still correctly selects only the first two vari- 
ables. Meanwhile, the Dantzig admits multiple solutions, namely all points 
belonging to the line connecting (0,0,1) and <y1,1 ^ . While it is true that 

^ 1 '^' ^ is one of the Dantzig solutions correctly selecting the first two variables 
and discarding the third, all other solutions incorrectly include the third 
variable. In the other extreme, (0, 0, 1) is also a solution where the first and 
second variables are wiped out from the model and only the third is added. 

In this example, r = 0.5 is a critical point where the irrepresentable con- 
dition (Zhao and Yu [23]) breaks down. The transition from below 0.5 to 
above can be seen in the third column of Figure 1, which depicts the contour 
plots of the estimated by Lasso and Dantzig: r varies from 0.35 to 0.70 
along the vertical direction and each horizontal line shows the whole path as 
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a function of A in the optimization problems (4) and (5) for a fixed r. When 
r > 0.5, both Lasso and Dantzig systematically select the wrong third pre- 
dictor (or the estimated fa's are nonzero). In terms of size of the incorrectly 
added coefficient, however, the transition is much sharper for Dantzig as r 
crosses 0.5. In fact, the solution of the Dantzig is not a Lipschitz continuous 
function of the observations for r = 0.5. This could be expected, as Dantzig 
is the solution of a linear program (LP) problem and the estimator can thus 
jump from one vertex in the box to another if the data changes slightly. 
When A varies, the regularization path for the Dantzig is piecewise linear. 
However, the flat faces of both the loss and the penalty functions can cause 
jumps in the path, similarly to what happens in the ^-penalized quantile 
regression (Rosset and Zhu [16]). This makes the design of an algorithm in 
the spirit of the homotopy/LARS-LASSO algorithm for the Lasso (Osborne, 
Presnell and Turlach [15] and Efron et al. [8]) more challenging and gives rise 
to jittery paths relative to Lasso and L2Boosting, as seen in the simulated 
example below. 

The first column of Figure 1 suggests that Lasso and Dantzig could coin- 
cide. At the very least, their regularization paths share the same terminal 
points given by the minimal ^i-norm vector of coefficients, causing the corre- 
lation of all predictors with the residuals to be zero. In fact, more similarities 
exist: we now provide a sufficient condition for the two paths to entirely agree 
when n>p. The condition is diagonal dominance of (X T X)~ l , that is, for 
M = {X T X)-\ 

(8) M j:j >J2\ M ij\ for all j = 1, . . . ,p. 

When p = 2, condition (8) is always satisfied so Lasso is exactly the same as 
Dantzig (and L2Boosting). Moreover, the irrepresentable condition is always 
satisfied as well. The diagonal dominance condition (8) is related to the 
positive cone condition used in Efron et al. [8] to show that L2Boosting 
and Lasso share the same path. The positive cone condition requires, for 
all subsets A C {1, . . . ,p} of variables, that Mjj > — J2i^j ^iji where M = 
(X^XX) -1 and is always trivially satisfied for p = 2. 

Theorem 1. Under the diagonal dominance condition (8), the Lasso 
solution (3) and the Dantzig solution (5) are identical for any value of X > 
(Lasso and Dantzig share the same path). 

Proof. First, define the vector g({3) = X T (Y -X(3) G W containing the 
correlation of the residuals with the original predictor variables. The Lasso 
solution is unique under condition (8). A necessary and sufficient condition 
for a vector j3 to be the Lasso solution is, by the Karush-Kuhn-Tucker 
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conditions (Bertsekas [1]), that (a) for all k: gk{0) £ [—A, A] and (b) for all 
k £ {I : 0i 7^ 0} it holds that gk(P) = Asign(/?fc). We show that the Dantzig 
solution (5) is a valid Lasso solution under diagonal dominance (8). The 
Dantzig fulfills condition (a) by construction. 

We now show that the (unique) Dantzig solution also satisfies (b). Assume 
to the contrary that is a solution of the Dantzig and there is some j G 
{k:/3k^ 0} such that gj(0) G [-A, A] but gj{0) ^ Asign(/?,). Let 5 £ W be a 
vector with 5k = for all k / j and Sj = sign(/3j) and define 7 = — (A^X)" 1 ^. 
We have g{(3 + wy) = g(0) + is5, so only the jth component of the vector of 
correlations is changed by an amount z/ sign (/?.,). Since we have assumed 
|<7j(/3)| < A, there exists some v > such that + ^7 is still in the feasible 
region. 

To complete the proof we now show that, under the diagonal dominance 
condition (8), the £i-norm of (3 + wy will be smaller than the £i-norm of 
for small values of v. Denote by (3-j the vector with entries identical to (3, 
except for the jth. component, which is set to zero. We can write 



where the first inequality results from using the triangle inequality twice and 
the second inequality stems from 7^. = — M/^ sign (/?.,■) with M = (X T X)~ l . 
It thus holds that, for small enough values of v > 0, the right-hand side 
is smaller than ||/3||i under the diagonal dominance condition (8). Hence, 
the vector f3 with gj(fi) 7^ Asign(/3j) cannot be the Dantzig solution. We 
conclude that the Dantzig solution must satisfy properties (a) and (b) and 
thus coincides with the Lasso solution (3). □ 

As alluded to earlier, the Dantzig selector needs the true a to be applied 
to real-world data. One obvious alternative is to use the Dantzig path and 
cross-validation. This gives another reason for obtaining the whole path. 
We define our data-driven Dantzig selector (DD) by computing cr^y — the 
smallest fivefold cross- validated mean squared error over the Dantzig path — 
and plugging it into A p (<7cv)- Needless to say, this estimator is not without 
its problems: one being that the cross-validated error might not be a good 
estimate of the prediction error in the p> n case and the other that it 
might overestimate a 2 . However, we decide to use it because it is sensible 
and simple. We later compare the performance of the data-driven Dantzig 
selector with the Dantzig estimator corresponding to the Acv chosen as the 
minimizer of the cross-validated mean squared error. 



110 + f7||i < \\P-j\\i + vh-j\\i + \Pj + v~/j\ 

< \\P-j\\i + v^2\M kj \ + -vM, 
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A more realistic simulation example is in order for further comparisons 
of Lasso and Dantzig. The following simulation example reflects the com- 
mon p> n situation seen in recent real-world data applications. L2Boosting, 
Lasso and Dantzig will be contrasted against each other in terms of algo- 
rithmic and performance behavior. Path smoothness will be examined and 
statistical performance criteria include MSE on the (3, MSE on the regres- 
sion function Xf3 and a variable selection quality plot (i.e., correctly selected 
variables relative to falsely selected variables). In addition, we vary the sig- 
nal to noise ratio and correlation level of the predictors to bring out more 
insight. 

Lasso, L2Boosting and Dantzig: p> n and correlated predictors. We con- 
sider random design with p = 60 variables and n = 40. Predictor variables 
have a multivariate Gaussian distribution X ~ J\f(0,T,), where the popu- 
lation covariance matrix £ of the predictor variables is Toeplitz, that is, 
Ej,- = p' 1 "- 7 ' for all 1 < i, j < p. The response vector Y is obtained as in (1), 

(9) Y = X(3* + as, 

where e = {s\, . . . ,e n ) is i.i.d. noise with a standard Gaussian distribution. 
The p-dimensional vector (3* is drawn once from a standard Gaussian dis- 
tribution and all but 10 randomly selected coefficients are set to zero. To be 
precise, the true parameter vector (3* used has entries 

-0.65, -0.38, -0.37, -0.27, -0.12, -0.08, 0.05, 0.24, 0.37, 0.41, 

for components 60, 2, 21, 49, 20, 27, 4, 43, 51, 32, with all other components 
set to zero. Three simulation setups are (a) p = 0, a = 0.2; (b) p = 0.9, a = 
0.2; (c) p = 0.9, a = 0.6. The vector (3* is rescaled in each case so that 
||X/3*||2 = n. We do not include the case that p = and a = 0.6 for the 
results are similar to (a). 

Computing the solution path for both Lasso and L2Boosting took under 
half a second of CPU time each, using the LARS software in R of Efron et 
al. [8]. Computing the solution path of the Dantzig for 200 distinct values of 
the regularization parameter A took more than 30 seconds on the same com- 
puter, using either a standard C linear programming library lp_solve (called 
from R) or the Matlab code supplied in the ^i-magic package (Candes [4]). 
The relatively long running time for the current Dantzig algorithms makes 
it necessary to develop a path-following algorithm. As mentioned before, the 
Dantzig path could have jumps and, as a result, its path-following algorithm 
could be somewhat more involved, as in Li and Zhu [12]. 

Other simulations with different randomly chosen sparse /3*'s were con- 
ducted and yielded similar results as was demonstrated with this particular 
choice of (3* . In almost all cases, Lasso and L2Boosting outperform Dantzig 
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Fig. 2. Regularization paths from a single realization for each setup (a), (b) and (c) for 
L2Boosting (first row), Lasso (second row) and Dantzig (third row). The Dantzig path is 
jittery for a very correlated design (large value of p). The ends of the paths (for A — > 0) 
agree for Dantzig and Lasso. 

and the Dantzig path is more jittery; when signal to noise ratio (SNR) is rel- 
atively high and the predictors are highly correlated, the performance gain 
of L2Boosting and Lasso over Dantzig cannot be ignored. 

Now let us look into the details of the results in Figures 2, 3 and 4. 
Figure 2 displays path plots under (a), (b) and (c) for a single realization of 
the linear model (9). The horizontal axes are scaled so that the path plots 
are comparable. Given everything else being equal, a correlation increase 
or an SNR decrease makes the path more jittery for all three methods, 
with various degrees. Across methods, L2Boosting's path is most smooth, 
Lasso's is less smooth and Dantzig's is most jittery. Moreover, under the 
same simulation setup, the branching points from zero of the three methods 
are quite similar although the path smoothness differs. 

Does the smoothness/jittery property of the path of a method readily 
translate into meaningful performance properties? Figures 3 and 4 attempt 
to answer this question. The first one shows that in terms of both MSE's, 
Lasso and L2Boosting are similar and in general better than Dantzig over 
the whole path. The improvement of Lasso or L2Boosting over Dantzig is 
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Fig. 3. For the three setups (a), (b) and (c), the first row shows the MSE's on j3 of the 
Dantzig, the Lasso and L2Boosting solution as a function of the regularization parameter 
A, averaged over 50 simulations. All three methods perform approximately equally well, 
with the exception of setting (b), where Dantzig performs worse. The vertical dotted line 
indicates the proposed fixed value of X p (a). The second row compares the solutions obtained 
by using the data-driven (Add ) and the cross-validation (Xcv ) tuning of the regularization 
parameter. In general, cross-validation gives a better fit except for the third setting (c) 
where the MSE on f3 favors the conservative data-driven Dantzig selector. The next two 
rows show comparable plots for the MSE's on X/3. Here, the difference between all three 
methods is even smaller. For all three setups, the cross-validation tuned regularization 
parameter Acv always results in a better MSE on Xf3 or a better predictive performance 
than its data-driven Dantzig selector counterpart Add- 
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Fig. 4. The average number of correctly selected variables as a function of the number 
of falsely selected variables, averaged over 50 simulations. The straight line corresponds to 
the performance under random selection of variables. Filled triangles indicate the solution 
under Add, whereas the solution for Acv is marked by squares. 



more pronounced for the MSE on (3 than that on X(3. The middle column 
in Figure 3, with high correlation between predictors and high SNR, shows 
the worst case for Dantzig, relative to L2Boosting and Lasso. Such results 
are in terms of both MSE's, with the MSE for (3 worse than the MSE for 
Xf3. This indicates qualitatively a regime where, when correlation and SNR 
are matched in some way, Dantzig is worse off than L2Boosting and Lasso. 
In other words, Lasso and L2Boosting are more effective to extract statis- 
tical information. With the same high correlation, however, when the SNR 
decreases (as shown in the right column of Figure 3), the statistical problem 
becomes hard for all of them and the advantage of Lasso and L2Boosting 
diminishes. For both MSE's, cross-validation selects better tuning param- 
eters for all three methods than the data-driven Dantzig (DD) with the 
exception of setup (c). In this setup, the noise level is high and so is the 
correlation level, estimating individual /3's becomes difficult and hence it is 
better to be conservative as Add sets many /3's to zero (cf. the rightmost plot 
in the second row of Figure 3). However, when the performance measure is 
on prediction or the MSE on X/3, Acv does better again than Add (cf. the 
rightmost plot in the fourth row of Figure 3). 

Last but not least, we assess the model selection prospect of the three 
methods with the CV-selected or the DD-selected tuning parameter A. Fig- 
ure 4 contains three plots under the three simulation setups. The horizontal 
axis plots the number of falsely selected variables and the vertical gives the 
corresponding correctly selected variables. Within each plot, the straight 
line gives the result of random selection of predictors; the solid curved line 
is Dantzig, dashed line is Lasso and dotted line is L2Boosting. The triangles 
indicate the DD selection and squares the CV selection of tuning parame- 
ters, for each method depending on the curve where the symbol is sitting. 
Obviously, all methods do better than random selection and the gain is high- 
est when the predictors are not correlated. The gain is reduced when the 
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correlation is high, but with a larger gain in the case of high SNR (middle 
plot) than the low-SNR case (right plot). In particular, the most differentiat- 
ing case is setup (b): high correlation and high SNR. For all three methods, 
CV would pick up two or three more correct predictors with the same false 
predictors as random selection, and there is a slight but definite advantage 
of L2Boosting and Lasso over Dantzig. For high correlation and low SNR, 
only one or two correct ones can be gained over random selection of the 
same number of falsely selected predictors. Clearly, DD is very conservative 
to select very few predictors for all three methods, while CV has a tendency 
to include too many noise variables for low SNR; this is well known and 
has already been studied in more detail in Leng, Lin and Wahba [11] and 
Meinshausen and Biihlmann [13]. Nevertheless, for all three methods CV 
seems to give a better balance on the total number of correct predictors and 
false predictors. For any choice of the regularization parameter, L2Boosting 
and Lasso are in general no worse and sometimes better than Dantzig. 

3. Concluding remarks. In this discussion, we have attempted to under- 
stand the Dantzig selector in relation to its cousins Lasso and L2Boosting. 
We believe that computing Dantzig or the Lasso for a single value of the 
penalty parameter A does not work well in practice; we need the entire 
solution path to select a meaningful model with good predictive perfor- 
mance. Without a path-following algorithm, computing the solution path 
for Dantzig is computationally very intensive (which is the reason we were 
limited to rather small data sets for the numerical examples). Leaving aside 
computational aspects, the first visual impression of the Dantzig solution 
path is its jitteriness when compared to the much smoother Lasso or 
L2Boosting solution paths, especially for highly correlated predictor vari- 
ables. However, we showed that the smoothness of the path is not always 
indicative of performance. For the same regularization parameter, Lasso and 
L2Boosting performed in all settings at least as well as the Dantzig selector 
(and sometimes substantially better) and Dantzig performed on par with 
Lasso and L2Boosting for low signal to noise ratio even though its path is 
much more jittery. For almost all settings considered, the regularization pa- 
rameter selected by cross-validation gives better MSE's than the data-driven 
Dantzig selector. In summary, we have not yet seen compelling evidence 
that would persuade us to use the Dantzig in practice rather than Lasso or 
L2Boosting. 

Acknowledgment. We would like to thank Martin Wainwright for helpful 
comments on an earlier version of the discussion. 
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